As I introduced in the README, I found concerning the fact of some workers leaving a company for several reasons: The future absence of a key player of the company, a non-amortised investment in personnel training, … So I decided to look into several dataset repositories in order to find a dataset that would fit my needs of analysing this phenomenon. I arrived at this dataset at Kaggle and I think it matches and fulfills my needs, as it has both categorical and numerical data to examine and addresses this exact same topic.
The variables that make part of this dataset are the following:
Age, worker’s age.
Attrition, target variable, it indicates if the worker is willing to abandon his/her company.
BusinessTravel, how often a worker travels due to work reasons.
DailyRate, daily rate of the employee.
Department, current department of the employee.
DistanceFromHome, distance (in miles) between the worker’s home and workplace.
Education, level of studies of the employee.
EducationField, choice of study of the employee.
EmployeeCount, constant 1.
EmployeeNumber, primary key of the observation.
EnvironmentSatisfaction, ordinal variable from 1 to 4 for Employee Satisfaction.
Gender, binary gender of the employee.
HourlyRate, hourly rate.
JobInvolvement, ordinal variable from 1 to 4 for employee involvement.
JobLevel, ordinal variable from 1 to 4 for employee performance.
JobRole, employee role at the company.
MaritalStatus, marital status of the employee.
MonthlyIncome, monthly income of the employee.
MonthlyRate, monthly rate of the employee.
NumCompaniesWorked, how many companies the employee has worked for.
Over18, constant 1.
OverTime, boolean to check if the employee does overtime.
PercentSalaryHike, salary rise since started working for the company.
PerformanceRating, ordinal variable from 1 to 4 with the perceived performance of the employee.
RelationshipSatisfaction, ordinal variable from 1 to 4 for the satisfaction of the employee with his/her company.
StandardHours, constant 80.
StockOptionLevel, ordinal variable from 1 to 4 with the easeness to acquire stock of the company.
TotalWorkingYears, employee number of years worked through his/her life.
WorkLifeBalance, ordinal variable from 1 to 4 with the level of work/life balance of the employee.
YearsAtCompany, employee number of years worked at the company.
YearsInCurrentRole, employee number of years worked at his/her current role.
YearsSinceLastPromotion, employee number of years worked since last promotion.
YearsWithCurrManager, employee number of years worked with the same manager.
# Library loading
if(!require(dplyr)){
install.packages('dplyr', dependencies=c("Depends", "Imports"), repos='http://cran.es.r-project.org')
require(dplyr)
}
if(!require(ggplot2)){
install.packages('ggplot2', dependencies=c("Depends", "Imports"), repos='http://cran.es.r-project.org')
require(ggplot2)
}
if(!require(reshape2)){
install.packages('reshape2', dependencies=c("Depends", "Imports"), repos='http://cran.es.r-project.org')
require(reshape2)
}
if(!require(grid)){
install.packages('grid', dependencies=c("Depends", "Imports"), repos='http://cran.es.r-project.org')
require(grid)
}
if(!require(gridExtra)){
install.packages('gridExtra', dependencies=c("Depends", "Imports"), repos='http://cran.es.r-project.org')
require(gridExtra)
}
if(!require(plotly)){
install.packages('plotly', dependencies=c("Depends", "Imports"), repos='http://cran.es.r-project.org')
require(plotly)
}
if(!require(caTools)){
install.packages('caTools', dependencies=c("Depends", "Imports"), repos='http://cran.es.r-project.org')
require(caTools)
}
if(!require(e1071)){
install.packages('e1071', dependencies=c("Depends", "Imports"), repos='http://cran.es.r-project.org')
require(e1071)
}
if(!require(randomForest)){
install.packages('randomForest', dependencies=c("Depends", "Imports"), repos='http://cran.es.r-project.org')
require(randomForest)
}
if(!require(caret)){
install.packages('caret', dependencies=c("Depends", "Imports"), repos='http://cran.es.r-project.org')
require(caret)
}
# Indicates the seed to replicate results
set.seed(1234)
# Dataset loading
dataset <- read.csv('../data/WA_Fn-UseC_-HR-Employee-Attrition.csv')
# Dropping the columns without value for our analysis
col_drops <- c('EmployeeCount', 'EmployeeNumber', 'Over18', 'StandardHours')
for (column in col_drops){
dataset[,column] <- NULL
}
# Conversion of strings to Factors in categorical columns
factores <- c("Attrition",
"BusinessTravel",
"Department",
"EducationField",
"Gender",
"JobRole",
"MaritalStatus",
"OverTime")
for(f in factores){
dataset[, f] <- as.factor(dataset[, f])
}
First of all, we check if the dataset has null values on it:
sprintf("Number of NULL values in the dataset: %i", sum(is.na(dataset)))
## [1] "Number of NULL values in the dataset: 0"
sprintf("Number of ' ' values in the dataset: %i", sum(dataset == ' '))
## [1] "Number of ' ' values in the dataset: 0"
sprintf("Number of '?' values in the dataset: %i", sum(dataset == '?'))
## [1] "Number of '?' values in the dataset: 0"
sprintf("Number of '' values in the dataset: %i", sum(dataset == ''))
## [1] "Number of '' values in the dataset: 0"
And as we can see, no null values are detected on this dataset, which makes us believe that data completition in this dataset would be high or total if no future problems are detected on it.
In order to have a solid ground to start our analysis, we describe the values and data types of the columns of the set:
str(dataset)
## 'data.frame': 1470 obs. of 31 variables:
## $ Age : int 41 49 37 33 27 32 59 30 38 36 ...
## $ Attrition : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 1 1 1 1 ...
## $ BusinessTravel : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 2 3 2 3 2 3 3 2 3 ...
## $ DailyRate : int 1102 279 1373 1392 591 1005 1324 1358 216 1299 ...
## $ Department : Factor w/ 3 levels "Human Resources",..: 3 2 2 2 2 2 2 2 2 2 ...
## $ DistanceFromHome : int 1 8 2 3 2 2 3 24 23 27 ...
## $ Education : int 2 1 2 4 1 2 3 1 3 3 ...
## $ EducationField : Factor w/ 6 levels "Human Resources",..: 2 2 5 2 4 2 4 2 2 4 ...
## $ EnvironmentSatisfaction : int 2 3 4 4 1 4 3 4 4 3 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 1 2 2 1 2 2 1 2 2 2 ...
## $ HourlyRate : int 94 61 92 56 40 79 81 67 44 94 ...
## $ JobInvolvement : int 3 2 2 3 3 3 4 3 2 3 ...
## $ JobLevel : int 2 2 1 1 1 1 1 1 3 2 ...
## $ JobRole : Factor w/ 9 levels "Healthcare Representative",..: 8 7 3 7 3 3 3 3 5 1 ...
## $ JobSatisfaction : int 4 2 3 3 2 4 1 3 3 3 ...
## $ MaritalStatus : Factor w/ 3 levels "Divorced","Married",..: 3 2 3 2 2 3 2 1 3 2 ...
## $ MonthlyIncome : int 5993 5130 2090 2909 3468 3068 2670 2693 9526 5237 ...
## $ MonthlyRate : int 19479 24907 2396 23159 16632 11864 9964 13335 8787 16577 ...
## $ NumCompaniesWorked : int 8 1 6 1 9 0 4 1 0 6 ...
## $ OverTime : Factor w/ 2 levels "No","Yes": 2 1 2 2 1 1 2 1 1 1 ...
## $ PercentSalaryHike : int 11 23 15 11 12 13 20 22 21 13 ...
## $ PerformanceRating : int 3 4 3 3 3 3 4 4 4 3 ...
## $ RelationshipSatisfaction: int 1 4 2 3 4 3 1 2 2 2 ...
## $ StockOptionLevel : int 0 1 0 0 1 0 3 1 0 2 ...
## $ TotalWorkingYears : int 8 10 7 8 6 8 12 1 10 17 ...
## $ TrainingTimesLastYear : int 0 3 3 3 3 2 3 2 2 3 ...
## $ WorkLifeBalance : int 1 3 3 3 3 2 2 3 3 2 ...
## $ YearsAtCompany : int 6 10 0 8 2 7 1 1 9 7 ...
## $ YearsInCurrentRole : int 4 7 0 7 2 7 0 0 7 7 ...
## $ YearsSinceLastPromotion : int 0 1 0 3 2 3 0 0 1 7 ...
## $ YearsWithCurrManager : int 5 7 0 0 2 6 0 0 8 7 ...
summary(dataset)
## Age Attrition BusinessTravel DailyRate
## Min. :18.00 No :1233 Non-Travel : 150 Min. : 102.0
## 1st Qu.:30.00 Yes: 237 Travel_Frequently: 277 1st Qu.: 465.0
## Median :36.00 Travel_Rarely :1043 Median : 802.0
## Mean :36.92 Mean : 802.5
## 3rd Qu.:43.00 3rd Qu.:1157.0
## Max. :60.00 Max. :1499.0
##
## Department DistanceFromHome Education
## Human Resources : 63 Min. : 1.000 Min. :1.000
## Research & Development:961 1st Qu.: 2.000 1st Qu.:2.000
## Sales :446 Median : 7.000 Median :3.000
## Mean : 9.193 Mean :2.913
## 3rd Qu.:14.000 3rd Qu.:4.000
## Max. :29.000 Max. :5.000
##
## EducationField EnvironmentSatisfaction Gender HourlyRate
## Human Resources : 27 Min. :1.000 Female:588 Min. : 30.00
## Life Sciences :606 1st Qu.:2.000 Male :882 1st Qu.: 48.00
## Marketing :159 Median :3.000 Median : 66.00
## Medical :464 Mean :2.722 Mean : 65.89
## Other : 82 3rd Qu.:4.000 3rd Qu.: 83.75
## Technical Degree:132 Max. :4.000 Max. :100.00
##
## JobInvolvement JobLevel JobRole JobSatisfaction
## Min. :1.00 Min. :1.000 Sales Executive :326 Min. :1.000
## 1st Qu.:2.00 1st Qu.:1.000 Research Scientist :292 1st Qu.:2.000
## Median :3.00 Median :2.000 Laboratory Technician :259 Median :3.000
## Mean :2.73 Mean :2.064 Manufacturing Director :145 Mean :2.729
## 3rd Qu.:3.00 3rd Qu.:3.000 Healthcare Representative:131 3rd Qu.:4.000
## Max. :4.00 Max. :5.000 Manager :102 Max. :4.000
## (Other) :215
## MaritalStatus MonthlyIncome MonthlyRate NumCompaniesWorked OverTime
## Divorced:327 Min. : 1009 Min. : 2094 Min. :0.000 No :1054
## Married :673 1st Qu.: 2911 1st Qu.: 8047 1st Qu.:1.000 Yes: 416
## Single :470 Median : 4919 Median :14236 Median :2.000
## Mean : 6503 Mean :14313 Mean :2.693
## 3rd Qu.: 8379 3rd Qu.:20462 3rd Qu.:4.000
## Max. :19999 Max. :26999 Max. :9.000
##
## PercentSalaryHike PerformanceRating RelationshipSatisfaction StockOptionLevel
## Min. :11.00 Min. :3.000 Min. :1.000 Min. :0.0000
## 1st Qu.:12.00 1st Qu.:3.000 1st Qu.:2.000 1st Qu.:0.0000
## Median :14.00 Median :3.000 Median :3.000 Median :1.0000
## Mean :15.21 Mean :3.154 Mean :2.712 Mean :0.7939
## 3rd Qu.:18.00 3rd Qu.:3.000 3rd Qu.:4.000 3rd Qu.:1.0000
## Max. :25.00 Max. :4.000 Max. :4.000 Max. :3.0000
##
## TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany
## Min. : 0.00 Min. :0.000 Min. :1.000 Min. : 0.000
## 1st Qu.: 6.00 1st Qu.:2.000 1st Qu.:2.000 1st Qu.: 3.000
## Median :10.00 Median :3.000 Median :3.000 Median : 5.000
## Mean :11.28 Mean :2.799 Mean :2.761 Mean : 7.008
## 3rd Qu.:15.00 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.: 9.000
## Max. :40.00 Max. :6.000 Max. :4.000 Max. :40.000
##
## YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
## Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 2.000 1st Qu.: 0.000 1st Qu.: 2.000
## Median : 3.000 Median : 1.000 Median : 3.000
## Mean : 4.229 Mean : 2.188 Mean : 4.123
## 3rd Qu.: 7.000 3rd Qu.: 3.000 3rd Qu.: 7.000
## Max. :18.000 Max. :15.000 Max. :17.000
##
And we can combine the summary table with the representation of the histogram of the continuos variables, in order to easier detect outliers and perceive the distribution of the different characteristics.
melt.dataset <- melt(dataset)
ggplot(data = melt.dataset, aes(x = value)) +
stat_density() +
facet_wrap(~variable, scales = "free")
We observe that the dataset is skewed to the left part of the histograms, which makes sense as most of the workers receive the less benefits inside organizations. At the same time, we can clearly see that most workers live more or less near their workplaces to achieve the most convenience. On the other hand, there are more or less constant variables (such as Rate) and some that follow a normal distribution (as Age, which is normal in a various range of datasets).
With the summary of the dataset on mind, we can now check the distribution of our target variable accross different factors. I selected three of them as interesting due to their importance in our quotidian lives:
grid.newpage()
plotbyGender <- ggplot(dataset,aes(Gender,fill=Attrition))+geom_bar() +labs(x="Gender", y="Employees")+ guides(fill=guide_legend(title=""))+ scale_fill_manual(values=c("black","#008000"))+ggtitle("By Gender")
plotbyOverTime <- ggplot(dataset,aes(OverTime,fill=Attrition))+geom_bar() +labs(x="OverTime", y="Employees")+ guides(fill=guide_legend(title=""))+ scale_fill_manual(values=c("black","#008000"))+ggtitle("By OverTime")
plotbyMaritalStatus <- ggplot(dataset,aes(MaritalStatus,fill=Attrition))+geom_bar() +labs(x="MaritalStatus", y="Employees")+ guides(fill=guide_legend(title=""))+ scale_fill_manual(values=c("black","#008000"))+ggtitle("By MaritalStatus")+theme(axis.text.x = element_text(angle=45))
grid.arrange(plotbyGender,plotbyOverTime,plotbyMaritalStatus,ncol=3)
And, more generally, we will also check the value of the columns with a considerable number of factors to see if they are relevant enough to be considered in our modelling. The columns in this case are:
qplot <- ggplot(dataset, aes(fill=EducationField, x=Attrition)) +
geom_bar(position="fill") +
ggtitle("EducationField by Attrition in %")
ggplotly(qplot)
We saw differences in proportion for Marketing and Technical Degrees as EducationField, as these fields are more representative in employees with attrition, and Medical, on the contrary as it is more represented on employeed not considering a career change.
qplot <- ggplot(dataset, aes(fill=JobRole, x=Attrition)) +
geom_bar(position="fill") +
ggtitle("JobRole by Attrition in %")
ggplotly(qplot)
Regarding JobRole, Sales Representative and Laboratory Technician employees tend to be more attire for a change in their job posts, and on the contrary Manager and Manufacturing Director are more representative on the negative value of the target value.
And now would be the time to do the same, this time using a continuous variable. In this case, I have chosen Age as dimension to pivot by in order to get a clear idea of attrition through this variable.
qplot <- ggplot(dataset, aes(x = Age)) +
geom_density(binwidth = 10,
aes(y = ..density..,
fill = Attrition),
position = "dodge", alpha = 0.5) +
ggtitle("Age Density per Attrition")
ggplotly(qplot)
We observe that people inclined towards attrition has a distribution more inclined in younger ages, highlighting its mode on 30-31 years and much lower distributions for older people than 40. People not thinking about changing job posts tend to be older and with a much more distributed age, close to the normal distribution (though with a wider cue on older ages).
qplot <- ggplot(dataset, aes(x = MonthlyIncome)) +
geom_density(binwidth = 10,
aes(y = ..density..,
fill = Attrition),
position = "dodge", alpha = 0.5) +
ggtitle("Monthly Income Density per Attrition")
ggplotly(qplot)
On the other hand, as it seems obvious, we could check the density of MonthlyIncome differentiating by Attrition too, and here we can check that density of salaries of workers thinking about quitting is much skewed towards lower salaries, with a very noticeable peak around $2500 per month. Although people without attrition has also salaries concentrated around $2500-6000, it is by far not as clear and corresponds to the natural distribution on Income in developed countries.
We need to prepare our data specifically to perform our modelling as efficiently as possible:
# Copying our dataset for changes to modelling
dataset_model <- dataset
# Converting binary columns to INT
dataset_model$Gender <- ifelse(dataset_model$Gender == 'Male', 1, 0)
dataset_model$OverTime <- ifelse(dataset_model$OverTime == 'Yes', 1, 0)
# Creating variables on EducationField and JobRole leaving the relevant values as booleans
dataset_model$MarketingTechnical <- ifelse(dataset_model$EducationField
%in% c('Marketing', 'Technical Degree'), 1, 0)
dataset_model$Medical <- ifelse(dataset_model$EducationField
%in% c('Medical'), 1, 0)
dataset_model$LaboratorySales <- ifelse(dataset$JobRole
%in% c('Laboratory Technician','Sales Representative'), 1, 0)
dataset_model$Director <- ifelse(dataset$JobRole
%in% c('Manager', 'Manufacturing Director'), 1, 0)
# Encoding rest of categorical columns
dv <- caret::dummyVars(" ~ BusinessTravel + Department + MaritalStatus", data = dataset_model)
new_columns <- data.frame(predict(dv, newdata = dataset_model))
dataset_model <- cbind(dataset_model, new_columns)
# Dropping columns with categorical values
dataset_model <- dataset_model[,c(-3, -5, -8,-14, -16, -38, -41, -44)]
# Scaling numerical columns
numerical_columns <- c('Age',
'DailyRate',
'DistanceFromHome',
'HourlyRate',
'MonthlyIncome',
'MonthlyRate',
'PercentSalaryHike',
'TotalWorkingYears',
'YearsAtCompany',
'YearsInCurrentRole',
'YearsWithCurrManager')
for (col in numerical_columns){
dataset_model[, col] <- scale(dataset_model[, col])
}
# Splitting into training and test
sample = sample.split(dataset_model$Attrition, SplitRatio = 2/3)
train = subset(dataset_model, sample == TRUE)
test = subset(dataset_model, sample == FALSE)
As we try to predict a categorical value with a binary output, we will focus our efforts in models specialized in this kind of task. From simpler to more complicated, we will perform a total of 4 different models:
# Making of the model
log_model <- glm(formula = Attrition~.,
family = 'binomial',
data = train)
# Predicting the Test set results
prob_pred = predict(log_model, type = 'response', newdata = test[-2])
log_pred = ifelse(prob_pred > 0.35, 1, 0)
# Getting the main indicators of our model
summary(log_model)
##
## Call:
## glm(formula = Attrition ~ ., family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6640 -0.4735 -0.2354 -0.0844 3.3747
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.651810 2.010783 2.313 0.020699 *
## Age -0.186526 0.150773 -1.237 0.216039
## DailyRate -0.047300 0.111864 -0.423 0.672417
## DistanceFromHome 0.339339 0.112067 3.028 0.002462 **
## Education -0.092350 0.110806 -0.833 0.404597
## EnvironmentSatisfaction -0.566843 0.106835 -5.306 1.12e-07 ***
## Gender 0.340214 0.233158 1.459 0.144521
## HourlyRate 0.042092 0.112465 0.374 0.708202
## JobInvolvement -0.620997 0.155692 -3.989 6.65e-05 ***
## JobLevel -0.211723 0.389494 -0.544 0.586726
## JobSatisfaction -0.406853 0.100989 -4.029 5.61e-05 ***
## MonthlyIncome -0.329159 0.436445 -0.754 0.450740
## MonthlyRate -0.053648 0.112096 -0.479 0.632231
## NumCompaniesWorked 0.208151 0.048836 4.262 2.02e-05 ***
## OverTime 2.046247 0.246004 8.318 < 2e-16 ***
## PercentSalaryHike 0.008585 0.171671 0.050 0.960118
## PerformanceRating -0.198380 0.497478 -0.399 0.690061
## RelationshipSatisfaction -0.320773 0.103693 -3.093 0.001978 **
## StockOptionLevel -0.071710 0.192633 -0.372 0.709698
## TotalWorkingYears -0.469255 0.271122 -1.731 0.083489 .
## TrainingTimesLastYear -0.216489 0.091676 -2.361 0.018203 *
## WorkLifeBalance -0.485512 0.160842 -3.019 0.002540 **
## YearsAtCompany 0.427890 0.299202 1.430 0.152686
## YearsInCurrentRole -0.561984 0.207100 -2.714 0.006656 **
## YearsSinceLastPromotion 0.213513 0.056319 3.791 0.000150 ***
## YearsWithCurrManager -0.397818 0.220935 -1.801 0.071764 .
## MarketingTechnical 0.743286 0.292840 2.538 0.011142 *
## Medical -0.230989 0.270058 -0.855 0.392369
## LaboratorySales 0.920166 0.268413 3.428 0.000608 ***
## Director 0.322959 0.458868 0.704 0.481547
## BusinessTravel.Non.Travel -1.058429 0.467620 -2.263 0.023609 *
## BusinessTravel.Travel_Frequently 1.185401 0.262721 4.512 6.42e-06 ***
## Department.Human.Resources 0.425736 0.568342 0.749 0.453806
## Department.Research...Development -0.568613 0.275673 -2.063 0.039147 *
## MaritalStatus.Divorced -1.251393 0.435912 -2.871 0.004095 **
## MaritalStatus.Married -1.138989 0.317375 -3.589 0.000332 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 865.72 on 979 degrees of freedom
## Residual deviance: 551.66 on 944 degrees of freedom
## AIC: 623.66
##
## Number of Fisher Scoring iterations: 6
# Making the Confusion Matrix
log_cm = table(test[, 2], log_pred > 0.35)
log_cm
##
## FALSE TRUE
## No 379 32
## Yes 36 43
# Calculating the precision, the accuracy and the recall
log_pre = (log_cm[1,1] + log_cm[2,2]) / sum(log_cm)
log_acc = log_cm[2, 2] / (log_cm[2,2] + log_cm[1,2]);
log_rec = log_cm[2, 2] / (log_cm[2,2] + log_cm[2,1]);
log_stats = rbind(precision = log_pre,
accuracy = log_acc,
recall = log_rec)
log_stats
## [,1]
## precision 0.8612245
## accuracy 0.5733333
## recall 0.5443038
# Making of the model
tree_model <- C50::C5.0(train[,-2], train[,2], rules=FALSE)
# Showing model stats
summary(tree_model)
##
## Call:
## C5.0.default(x = train[, -2], y = train[, 2], rules = FALSE)
##
##
## C5.0 [Release 2.07 GPL Edition] Fri Feb 26 23:55:45 2021
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 980 cases (36 attributes) from undefined.data
##
## Decision tree:
##
## TotalWorkingYears <= -1.321152:
## :...BusinessTravel.Travel_Frequently > 0: Yes (10/1)
## : BusinessTravel.Travel_Frequently <= 0:
## : :...TrainingTimesLastYear <= 1: Yes (4)
## : TrainingTimesLastYear > 1:
## : :...MaritalStatus.Divorced > 0: No (10)
## : MaritalStatus.Divorced <= 0:
## : :...MarketingTechnical > 0: Yes (8/1)
## : MarketingTechnical <= 0:
## : :...StockOptionLevel > 0: No (5)
## : StockOptionLevel <= 0:
## : :...OverTime > 0: Yes (5)
## : OverTime <= 0:
## : :...WorkLifeBalance > 2: No (13/1)
## : WorkLifeBalance <= 2:
## : :...WorkLifeBalance <= 1: No (2)
## : WorkLifeBalance > 1: Yes (4)
## TotalWorkingYears > -1.321152:
## :...OverTime <= 0:
## :...WorkLifeBalance <= 1:
## : :...MaritalStatus.Married > 0: No (17/1)
## : : MaritalStatus.Married <= 0:
## : : :...Education > 3: Yes (4)
## : : Education <= 3:
## : : :...EnvironmentSatisfaction <= 2: Yes (4/1)
## : : EnvironmentSatisfaction > 2: No (6)
## : WorkLifeBalance > 1:
## : :...BusinessTravel.Travel_Frequently <= 0: No (513/32)
## : BusinessTravel.Travel_Frequently > 0:
## : :...DailyRate > -1.13129: No (100/11)
## : DailyRate <= -1.13129:
## : :...PerformanceRating > 3: No (5)
## : PerformanceRating <= 3:
## : :...Director <= 0: Yes (9/1)
## : Director > 0: No (2)
## OverTime > 0:
## :...MonthlyIncome > -0.6367797:
## :...EnvironmentSatisfaction > 2: No (129/12)
## : EnvironmentSatisfaction <= 2:
## : :...NumCompaniesWorked > 8: Yes (4)
## : NumCompaniesWorked <= 8:
## : :...JobSatisfaction > 3: No (24/2)
## : JobSatisfaction <= 3:
## : :...JobInvolvement <= 1: Yes (4)
## : JobInvolvement > 1:
## : :...MarketingTechnical > 0: Yes (4/1)
## : MarketingTechnical <= 0:
## : :...DistanceFromHome <= 1.45648:
## : :...MonthlyIncome <= -0.495317: Yes (2)
## : : MonthlyIncome > -0.495317: No (20)
## : DistanceFromHome > 1.45648:
## : :...MonthlyRate <= 0.6860134: Yes (6)
## : MonthlyRate > 0.6860134: No (2)
## MonthlyIncome <= -0.6367797:
## :...BusinessTravel.Travel_Frequently > 0: Yes (17/3)
## BusinessTravel.Travel_Frequently <= 0:
## :...JobInvolvement > 3: No (5)
## JobInvolvement <= 3:
## :...JobInvolvement <= 1: Yes (4)
## JobInvolvement > 1:
## :...YearsInCurrentRole <= -0.8912861: Yes (5)
## YearsInCurrentRole > -0.8912861:
## :...Department.Human.Resources > 0: No (2)
## Department.Human.Resources <= 0:
## :...MaritalStatus.Married > 0: No (10/2)
## MaritalStatus.Married <= 0:
## :...YearsSinceLastPromotion > 4: Yes (2)
## YearsSinceLastPromotion <= 4:
## :...RelationshipSatisfaction > 3: No (7/1)
## RelationshipSatisfaction <= 3: [S1]
##
## SubTree [S1]
##
## Department.Research...Development <= 0: Yes (2)
## Department.Research...Development > 0:
## :...DistanceFromHome <= -0.8872132: No (3)
## DistanceFromHome > -0.8872132: Yes (7/1)
##
##
## Evaluation on training data (980 cases):
##
## Decision Tree
## ----------------
## Size Errors
##
## 38 71( 7.2%) <<
##
##
## (a) (b) <-classified as
## ---- ----
## 813 9 (a): class No
## 62 96 (b): class Yes
##
##
## Attribute usage:
##
## 100.00% TotalWorkingYears
## 96.22% OverTime
## 76.94% BusinessTravel.Travel_Frequently
## 69.29% WorkLifeBalance
## 26.43% MonthlyIncome
## 20.92% EnvironmentSatisfaction
## 11.84% DailyRate
## 8.67% JobInvolvement
## 7.24% MarketingTechnical
## 6.73% NumCompaniesWorked
## 6.33% JobSatisfaction
## 6.33% MaritalStatus.Married
## 5.20% TrainingTimesLastYear
## 4.80% MaritalStatus.Divorced
## 4.08% DistanceFromHome
## 3.88% YearsInCurrentRole
## 3.37% Department.Human.Resources
## 2.96% StockOptionLevel
## 2.14% YearsSinceLastPromotion
## 1.94% RelationshipSatisfaction
## 1.63% PerformanceRating
## 1.43% Education
## 1.22% Department.Research...Development
## 1.12% Director
## 0.82% MonthlyRate
##
##
## Time: 0.0 secs
# Predicting test data
tree_pred <- predict(tree_model, newdata = test)
# Making the Confusion Matrix
tree_cm = table(test[, 2], tree_pred)
tree_cm
## tree_pred
## No Yes
## No 375 36
## Yes 55 24
# Calculating the precision, the accuracy and the recall
tree_pre = (tree_cm[1,1] + tree_cm[2,2]) / sum(tree_cm)
tree_acc = tree_cm[2, 2] / (tree_cm[2,2] + tree_cm[1,2]);
tree_rec = tree_cm[2, 2] / (tree_cm[2,2] + tree_cm[2,1]);
tree_stats = rbind(precision = tree_pre,
accuracy = tree_acc,
recall = tree_rec)
tree_stats
## [,1]
## precision 0.8142857
## accuracy 0.4000000
## recall 0.3037975
# Making of the model
svm_model = svm(formula = Attrition ~ .,
data = train,
type = 'C-classification',
kernel = 'radial')
# Predicting the Test set results
svm_pred = predict(svm_model, newdata = test[-2])
# Making the Confusion Matrix
svm_cm = table(test[, 2], svm_pred)
svm_cm
## svm_pred
## No Yes
## No 405 6
## Yes 65 14
# Calculating the precision, the accuracy and the recall
svm_pre = (svm_cm[1,1] + svm_cm[2,2]) / sum(svm_cm)
svm_acc = svm_cm[2, 2] / (svm_cm[2,2] + svm_cm[1,2]);
svm_rec = svm_cm[2, 2] / (svm_cm[2,2] + svm_cm[2,1]);
svm_stats = rbind(precision = svm_pre,
accuracy = svm_acc,
recall = svm_rec)
svm_stats
## [,1]
## precision 0.8551020
## accuracy 0.7000000
## recall 0.1772152
# Making of the model
rfor_model = randomForest(x = train[-2],
y = train$Attrition,
ntree = 500)
# Predicting the Test set results
rfor_pred = predict(rfor_model, newdata = test[-2])
# Making the Confusion Matrix
rfor_cm = table(test[, 2], rfor_pred)
rfor_cm
## rfor_pred
## No Yes
## No 409 2
## Yes 68 11
# Calculating the precision, the accuracy and the recall
rfor_pre = (rfor_cm[1,1] + rfor_cm[2,2]) / sum(rfor_cm)
rfor_acc = rfor_cm[2, 2] / (rfor_cm[2,2] + rfor_cm[1,2]);
rfor_rec = rfor_cm[2, 2] / (rfor_cm[2,2] + rfor_cm[2,1]);
rfor_stats = rbind(precision = rfor_pre,
accuracy = rfor_acc,
recall = rfor_rec)
rfor_stats
## [,1]
## precision 0.8571429
## accuracy 0.8461538
## recall 0.1392405